RNA-Seq experiment on six DC populations either on C57Black6 mice vs C57black6 mice with STAT6 knockout. DC populations are defined by CD11b, CD103, CD24 status from sorting.
There are many steps involved in analysing an RNA-Seq experiment. Analysing an RNAseq experiment begins with sequencing reads. These are aligned to a reference genome, then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analyses on in R. While mapping and counting are important and necessary tasks, in this report we will be starting from the count data and getting stuck into analysis.
The entirety of this dataset was downloaded from a Globus link provided by Otago Genomics who performed the RNA-Sequencing on the 5th October 2019. The unique ID for the sequencing was OG4953 with 36 samples labelled split across three sequencing lanes. The 36 samples were comprised of three replicates of 12 cell populations. Six populations were labeled as B6, the other six as KO. The six cell populations were labeled as 24p, 24n, 103p, 103n, 11bp, and TN.
I used Trimmomatic v0.36 to trim our RNA-Seq reads. I removed Illumina adapters, base pairs with < 30 phred quality score, and reads less than 20 bp in length.
https://academic.oup.com/bioinformatics/article/30/15/2114/2390096
Anthony M. Bolger, Marc Lohse, Bjoern Usadel, Trimmomatic: a flexible trimmer for Illumina sequence data, Bioinformatics, Volume 30, Issue 15, 1 August 2014, Pages 2114–2120, https://doi.org/10.1093/bioinformatics/btu170
For reproducibility, we used the publicly available Docker comics/trimmomatic which was currently running trimmomatic 0.36 using the following code
# Concatenate all RNA-Seq exeperiment fastq files
for x in 1 2 3 4 5 6 7 8 9;
do
echo merging $x
cat CDT11ANXX-4953-0$x-50-1_S$x_L00*_R1_001.fastq.gz > combined/4953_S0${x}_R1.fastq.gz
cat CDT11ANXX-4953-0$x-50-1_S$x_L00*_R2_001.fastq.gz > combined/4953_S0${x}_R2.fastq.gz
done
for x in 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36;
do
echo merging $x;
cat CDT11ANXX-4953-$x-50-1_S$x_L00*_R1_001.fastq.gz > combined/4953_S${x}_R1.fastq.gz
cat CDT11ANXX-4953-$x-50-1_S$x_L00*_R2_001.fastq.gz > combined/4953_S${x}_R2.fastq.gz
done
# Trimmomatic 0.36 docker and code
docker run -v /stornext/BioinfoSAN/Bioinformatics/Illumina/OG4953/OG4953_fastq/combined:/data/ -it comics/trimmomatic:0.36 bash
ADAPTfile=/software/applications/Trimmomatic/0.36/adapters/TruSeq3-PE-2.fa;
for x in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36;
echo trimming sample $x;
java -jar $TRIMMOMATIC PE -threads 12 4953_S${x}_R1.fastq.gz 4953_S${x}_R2.fastq.gz -baseout trimmed/4953_S${x}.fq.gz ILLUMINACLIP:${ADAPTfile}:2:15:8:5:true LEADING:3 TRAILING:3 SLIDINGWINDOW:10:20 MINLEN:20;
done
We used STAR 2.7.1a implemented in an in-house Docker container to map our reads to the M16 mm10 mouse genome.
https://github.com/alexdobin/STAR
https://academic.oup.com/bioinformatics/article/29/1/15/272537
STAR: ultrafast universal RNA-seq aligner. Alexander Dobin, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, Thomas R. Gingeras. Bioinformatics, Volume 29, Issue 1, January 2013, Pages 15–21, https://doi.org/10.1093/bioinformatics/bts635
We aligned to Gencode mouse primary assembly M16 https://www.gencodegenes.org/mouse/release_M16.html
# STAR 2.7.1a docker and code
docker run -v /stornext/BioinfoSAN/Bioinformatics/Illumina/OG4953/OG4953_fastq/combined/trimmed:/data/ -it samold_staraligner/star2.7.1a:latest bash
STAR --runThreadN 12 -- runMode genomeGenerate --genomeDir /data/STAR_Mapped/star_genome_directory --genomeFastaFiles /data/STAR_Mapped/star_genome/GRCm38.primary_assembly.genome.fa --sjdbGTFfile /data/STAR_Mapped/star_genome/gencode.vM16.primary_assembly.annotation.gff3
for x in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36;
do echo mapping $x;
STAR --runThreadN 12 --genomeDir /data/STAR_mapped/star_genome_directory --readFilesIn /data/4953_S${x}_1P.fq.gz /data/4953_S${x}_2P.fq.gz --readFilesCommand gunzip -c --outFileNamePrefix /data/STAR_mapped/output/${x}_STAR_9.Oct.19_
done
featureCounts 1.28.1 on R 3.4.4 “Someone to lean on” was used to count the mapped reads directly from the SAM file output from STAR.
https://www.ncbi.nlm.nih.gov/pubmed/24227677
> citation("Rsubread")
Please cite Rsubread as below:
Liao Y, Smyth GK and Shi W (2013). The Subread aligner: fast,
accurate and scalable read mapping by seed-and-vote. Nucleic Acids
Research, 41(10):e108.
A BibTeX entry for LaTeX users is
@Article{,
title = {The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote},
author = {Yang Liao and Gordon K. Smyth and Wei Shi},
journal = {Nucleic Acids Research},
year = {2013},
volume = {41},
issue = {10},
pages = {e108},
}
> R.Version()
$platform
[1] "x86_64-pc-linux-gnu"
$arch
[1] "x86_64"
$os
[1] "linux-gnu"
$system
[1] "x86_64, linux-gnu"
$status
[1] ""
$major
[1] "3"
$minor
[1] "4.4"
$year
[1] "2018"
$month
[1] "03"
$day
[1] "15"
$`svn rev`
[1] "74408"
$language
[1] "R"
$version.string
[1] "R version 3.4.4 (2018-03-15)"
$nickname
[1] "Someone to Lean On"
> sessionInfo(
+ )
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsubread_1.28.1
loaded via a namespace (and not attached):
[1] compiler_3.4.4 tools_3.4.4
I created a Docker to use featureCounts in a reproducible manner.
# Rsubread
docker run -v /stornext/BioinfoSAN/Bioinformatics/Illumina/OG4953/OG4953_fastq/combined/trimmed/STAR_mapped:/data/ -it rsubread/afterstar:latest R
library(Rsubread)
sampleVector = c("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36")
samFileNames=NULL;
for(sampleName in sampleVector){
samFileNames = c(samFileNames, paste0("/data/output/", sampleName, "_STAR_9.Oct.19_Aligned.out.sam"))
}
FCFile = featureCounts(nthreads = 12,
primaryOnly = TRUE,
isPairedEnd = TRUE,
checkFragLength = TRUE,
files = samFileNames,
GTF.featureType = "gene",
GTF.attrType = "gene_name",
isGTFAnnotationFile = TRUE,
allowMultiOverlap=F,
countMultiMappingReads = T,
annot.ext = "/data/star_genome/gencode.vM16.primary_assembly.annotation.gff3")
write.csv(FCFile$counts, file="/data/output/RNASeq_raw_counts_STAR_FR_2019-10-09.csv")
write.csv(FCFile$stat, file="/data/output/FCFile$stat_STAR_FR_2019-10-09.csv")
I count multi-mapped reads because I align to the genome, and best practise reccomends that they are not discarded: “Genomic multireads are primarily due to repetitive sequences or shared domains of paralogous genes. They normally account for a significant fraction of the mapping output when mapped onto the genome and should not be discarded.”
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8
A survey of best practices for RNA-seq data analysis Ana Conesa, Pedro Madrigal, Sonia Tarazona, David Gomez-Cabrero, Alejandra Cervera, Andrew McPherson, Michał Wojciech Szcześniak, Daniel J. Gaffney, Laura L. Elo, Xuegong Zhang & Ali Mortazavi Genome Biology Volume 17, Article number: 13 (2016)
Pearson correlation scores between replicates when top 1% outliers and transcripts with less than 5 reads (on average) are removed.
Pearson Correlation matrices between each sample with outliers removed. Outliers are the top 1% most reads transcripts, and transcripts with less than 5 reads. Mouseoverable plot.
Sample 14 is clearly too much of an outlier to even include in any Quality analysis, so I will remove Sample 14 from all analyses from now on.
Pearson Correlation matrices between each sample with outliers removed. Outliers are the top 1% most reads transcripts, and transcripts with less than 5 reads. Mouseoverable plot.
Spearman Correlation matrices between each sample with outliers removed. Outliers are the top 1% most reads transcripts, and transcripts with less than 5 reads. Mouseoverable plot.
Conclusion: Triple replicates are all strongly correlated with both Pearson & Spearman correlations > 0.9 R\(^2\) when top 1% outliers and transcripts less than 5 reads are removed. (These cutoffs are chosen from Immgen standards if citation needed).
This is the same PCA repeated 4 times to show different labels
I’ve provided in this document all code required for recalculation of results when the RNA-Seq data is made publicly available, with citations. Not a complete document, there is still analysis to be performed.